read_csv and read_parquet for review #1

simon-mo · 2018-02-20T10:40:54Z

No description provided.

devin-petersohn · 2018-02-20T17:58:52Z

python/ray/dataframe/io.py

+from .dataframe import from_pandas, ray, DataFrame
+
+
+def read_parquet(path, columns=None, npartitions=None, chunksize=None):


Let's try to stay with the Pandas sinature: pandas.read_parquet(path, engine='auto', columns=None, **kwargs).

Make sense. Where should npartitions and chunksiz goes?

After some thought

the best way to do it is to set default. Since that will keep the API the same and the users don't need to set worry about what arguments are needed to ray. The chunksize and npartitions are still there in keyword arguments for them to fine-tune.

Example:

# inside ray/dataframe/dataframe.py DEFAULT_NPARTITIONS = 2 DEFAULT_CHUNKSIZE = 20 def set_npartition_default(n): global DEFAULT_NPARTITIONS DEFAULT_NPARTITIONS = n def set_chunksize_default(s): global s DEFAULT_CHUNKSIZE = s def from_pandas(df, npartitions=None, chunksize=None): if not npartitions and not chunksize: npartitions = DEFAULT_PARTITIONS ....

This scheme is future-proof. If we are going to devise some scheme to set the default chunksize automatically for users, we can do it inside the set_npartition_default dynamically.

I think this makes a lot of sense. I think that we have typically been creating 1x the number of virtual CPUs, but it has been manual. Ray knows how many CPUs it can access, so we just need to have that. We can tune this a bit ourselves for the average case and leave it to users to tune the rest.

devin-petersohn · 2018-02-20T18:00:32Z

python/ray/dataframe/io.py

+    return from_pandas(pd.read_csv(BytesIO(to_read)), npartitions=1)
+
+
+def read_csv(path, npartitions, **kwargs):


Let's try to keep Pandas signature: pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None, header='infer', names=None, index_col=None, usecols=None, squeeze=False, prefix=None, mangle_dupe_cols=True, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=False, keep_date_col=False, date_parser=None, dayfirst=False, iterator=False, chunksize=None, compression='infer', thousands=None, decimal=b'.', lineterminator=None, quotechar='"', quoting=0, escapechar=None, comment=None, encoding=None, dialect=None, tupleize_cols=None, error_bad_lines=True, warn_bad_lines=True, skipfooter=0, skip_footer=0, doublequote=True, delim_whitespace=False, as_recarray=None, compact_ints=None, use_unsigned=None, low_memory=True, buffer_lines=None, memory_map=False, float_precision=None).

If there are things that don't get implemented yet, you can just have it return a NotImplementedError.

😱. These can be passed into pd.read_csv(...) so there won't be any NotImplementedError.
But same question again about npartitions and chunksize

See update in read_parquet

I agree about all the issues regarding the number of parameters, but our eventual goal is import ray.dataframe as pd and just let that be the only change necessary.

devin-petersohn · 2018-02-20T18:02:31Z

python/ray/dataframe/test/test_io.py

+
+
+@pytest.fixture
+def ray_df_equals_pandas_wo_index(ray_df, pandas_df):


Do we need an __eq__() for index to check that they are totally the same?

The issue with index is that pandas will have a continuous range index but all ray df starts index at 0. I'll change the test back to ray_df_equals_pandas once the new index scheme is merged.

nvm. Just saw it merged. Working on refining this now.

We can have a Index.equals(other) and a __eq__(other) that does different things based on the type of other.

devin-petersohn · 2018-02-21T15:32:38Z

Strange, it looks like github ate my comments from yesterday so I just resubmitted them.

…nto df_io

* Precede ray.get with ray.wait. * Trigger checkpoint deletes locally in Trainable * Clean-up code. * Minor changes. * Track best checkpoint so far again * Pulled checkpoint GC out of Trainable. * Added comments, error logging. * Immediate pull after checkpoint taken; rsync source delete on pull * Minor doc fixes * Fix checkpoint manager bug * Fix bugs, tests, formatting * Fix bugs, feature flag for force sync. * Fix test. * Fix minor bugs: clear proc and less verbose sync_on_checkpoint warnings. * Fix bug: update IP of last_result. * Fixed message. * Added a lot of logging. * Changes to ray trial executor. * More bug fixes (logging after failure), better logging. * Fix richards bug and logging * Add comments. * try-except * Fix heapq bug. * . * Move handling of no available trials to ray_trial_executor (#1) * Fix formatting bug, lint. * Addressed Richard's comments * Revert tests. * fix rebase * Fix trial location reporting. * Fix test * Fix lint * Rebase, use ray.get w/ timeout, lint. * lint * fix rebase * Address richard's comments

* WIP. * Fix. * Fix. * Fix.

…unicator caching (ray-project#12935) * other collectives all work * auto-linting * mannual linting #1 * mannual linting 2 * bugfix * add send/recv point-to-point calls * add some initial code for communicator caching * auto linting * optimize imports * minor fix * fix unpassed tests * support more dtypes * rerun some distributed tests for send/recv * linting

ray-project#18604)

We encountered SIGSEGV when running Python test `python/ray/tests/test_failure_2.py::test_list_named_actors_timeout`. The stack is: ``` #0 0x00007fffed30f393 in std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&) () from /lib64/libstdc++.so.6 #1 0x00007fffee707649 in ray::RayLog::GetLoggerName() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #2 0x00007fffee70aa90 in ray::SpdLogMessage::Flush() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #3 0x00007fffee70af28 in ray::RayLog::~RayLog() () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #4 0x00007fffee2b570d in ray::asio::testing::(anonymous namespace)::DelayManager::Init() [clone .constprop.0] () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #5 0x00007fffedd0d95a in _GLOBAL__sub_I_asio_chaos.cc () from /home/admin/dev/Arc/merge/ray/python/ray/_raylet.so #6 0x00007ffff7fe282a in call_init.part () from /lib64/ld-linux-x86-64.so.2 #7 0x00007ffff7fe2931 in _dl_init () from /lib64/ld-linux-x86-64.so.2 #8 0x00007ffff7fe674c in dl_open_worker () from /lib64/ld-linux-x86-64.so.2 #9 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #10 0x00007ffff7fe5ffe in _dl_open () from /lib64/ld-linux-x86-64.so.2 #11 0x00007ffff7d5f39c in dlopen_doit () from /lib64/libdl.so.2 #12 0x00007ffff7b82e79 in _dl_catch_exception () from /lib64/libc.so.6 #13 0x00007ffff7b82f13 in _dl_catch_error () from /lib64/libc.so.6 #14 0x00007ffff7d5fb09 in _dlerror_run () from /lib64/libdl.so.2 #15 0x00007ffff7d5f42a in dlopen@@GLIBC_2.2.5 () from /lib64/libdl.so.2 #16 0x00007fffef04d330 in py_dl_open (self=<optimized out>, args=<optimized out>) at /tmp/python-build.20220507135524.257789/Python-3.7.11/Modules/_ctypes/callproc.c:1369 ``` The root cause is that when loading `_raylet.so`, `static DelayManager _delay_manager` is initialized and `RAY_LOG(ERROR) << "RAY_testing_asio_delay_us is set to " << delay_env;` is executed. However, the static variables declared in `logging.cc` are not initialized yet (in this case, `std::string RayLog::logger_name_ = "ray_log_sink"`). It's better not to rely on the initialization order of static variables in different compilation units because it's not guaranteed. I propose to change all `RAY_LOG`s to `std::cerr` in `DelayManager::Init()`. The crash happens in Ant's internal codebase. Not sure why this test case passes in the community version though. BTW, I've tried different approaches: 1. Using a static local variable in `get_delay_us` and remove the global variable. This doesn't work because `init()` needs to access the variable as well. 2. Defining the global variable as type `std::unique_ptr<DelayManager>` and initialize it in `get_delay_us`. This works but it requires a lock to be thread-safe.

simon-mo added 5 commits February 19, 2018 21:11

Add parquet-cpp to gitignore

a28e092

Add read_csv and read_parquet

9678cc4

Gitignore pytest_cache

184b544

Fix flake8

91a5d7b

Add io to __init__

a6f27e6

devin-petersohn reviewed Feb 21, 2018

View reviewed changes

simon-mo and others added 23 commits February 21, 2018 14:45

Merge branch 'master' of https://github.com/ray-project/ray into df_io

9734308

Changing Index. Currently running tests, but so far untested.

bba597c

Removing issue of reassigning DF in from_pandas

82f2bd7

Fixing lint

48dc6fc

Fix bug

bf42a26

Fix bug

f0b2069

Fix bug

52e269a

Better performance

ac717b2

Fixing index issue with sum

7efea34

Address comments

15a5271

Merge branch 'df_patch03' of https://github.com/devin-petersohn/ray i…

ef80f9f

…nto df_io

Update io with index

32dbd9e

Updating performance and implementation. Adding tests

c1d781e

Fixing off-by-1

f5ee1bd

Fix lint

aa9aca8

Address Comments

a268c40

Merge branch 'df_patch03' of https://github.com/devin-petersohn/ray i…

3f1d958

…nto df_io

Make pop compatible with new to_pandas

223f3ed

Format Code

5e583e6

Merge branch 'master' of https://github.com/ray-project/ray into df_io

20fffcc

Cleanup some index issue

634dba4

Bug fix: assigned reset_index back

9b69cfc

Merge branch 'master' into df_io

631e418

Remove unused debug line

d19514e

simon-mo closed this Aug 20, 2020

simon-mo pushed a commit that referenced this pull request Nov 30, 2020

[RLlib] Attention Net prep PR #1: Smaller cleanups. (ray-project#12447)

0df55a1

* WIP. * Fix. * Fix. * Fix.

simon-mo pushed a commit that referenced this pull request Dec 28, 2020

[RLlib] JAXPolicy prep. PR #1. (ray-project#13077)

99ae7ba

simon-mo pushed a commit that referenced this pull request Sep 9, 2021

[RLlib] No Preprocessors; preparatory PR #1 (ray-project#18367)

8a06647

simon-mo pushed a commit that referenced this pull request Sep 15, 2021

Revert "Revert "Route core worker ERROR/FATAL logs to driver logs (#1… (

15512c2

ray-project#18604)

simon-mo pushed a commit that referenced this pull request Jan 14, 2022

[RLlib] Decentralized multi-agent learning; PR #1 (ray-project#21421)

90c6b10

		from .dataframe import from_pandas, ray, DataFrame


		def read_parquet(path, columns=None, npartitions=None, chunksize=None):

		return from_pandas(pd.read_csv(BytesIO(to_read)), npartitions=1)


		def read_csv(path, npartitions, **kwargs):



		@pytest.fixture
		def ray_df_equals_pandas_wo_index(ray_df, pandas_df):

read_csv and read_parquet for review #1

read_csv and read_parquet for review #1

Uh oh!

Conversation

simon-mo commented Feb 20, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simon-mo Feb 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

devin-petersohn commented Feb 21, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

simon-mo Feb 21, 2018 •

edited

Loading